Case Study on LLVM as suitable intermediate language for binary analysis

نویسنده

  • Florian Märkl
چکیده

Many binary analysis tools and compilers, instead of directly working on code, use an intermediate representation of it. The idea of this thesis is to use the well-tested intermediate representation from LLVM for binary analysis tasks. We take a look at McSema, a tool to translate x86 and x86_64 binaries to LLVM, describe its translation process in detail and additionally implement Python bindings for it. To practically test McSema, we present five examples of code we translate to LLVM and then recompile again. The last of these demos is an example on using KLEE, a symbolic execution engine for LLVM, on the code produced by McSema in order to successfully solve a CrackMe. We conclude that McSema’s translation approach provides a suitable way to extract functions from binaries to integrate them in other code or to analyse them using symbolic execution, as well as serving as a potential basis to implement an LLVM-based decompiler. We also compare it to Remill, another tool similar to McSema, which generates code that represents the assembly code more explicitly and VEX, the intermediate representation used in Valgind and Angr, which is also more close to the machine code.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Enabling sophisticated analyses of ×86 binaries with RevGen

Current state-of-the-art static analysis tools for binary software operate on ad-hoc intermediate representations (IR) of the machine code. Therefore, even though IRs facilitate program analysis by abstracting away the source language, it is hard to reuse existing implementations of analysis tools in new endeavors. Recently, a new compiler framework— LLVM— has emerged, together with many analys...

متن کامل

LLVM Optimizations for PGAS Programs Case study: LLVMWide Pointer Optimizations in Chapel

PGAS programming languages such as Chapel, Coarray Fortran, Habanero-C, UPC and X10 [3–6, 8] support high-level and highly productive programming models for large-scale parallelism. Unlike messagepassing models such as MPI, which introduce nontrivial complexity due to message passing semantics, PGAS languages simplify distributed parallel programming by introducing higher level parallel languag...

متن کامل

Decompilation to Compiler High IR in a binary rewriter

A binary rewriter is a piece of software that accepts a binary executable program as input, and produces an improved executable as output. This paper describes the first technique in literature to decompile the input binary into an existing compiler’s high-level intermediate form (IR). The compiler’s back-end is then used to generate the output binary from the IR. Doing so enables the use of th...

متن کامل

Automatic Generation of Assembly to IR Translators Using Compilers

Translating low-level machine instructions into higher-level intermediate representation (IR) is one of the central steps in many binary translation, analysis and instrumentation systems. Most of these systems manually build the machine instruction to IR mapping table needed for such a translation. As a result, these systems often suffer from two problems: (a) a great deal of manual effort is r...

متن کامل

Metamorphic Code from LLVM IR Bytecode

Metamorphic software changes its internal structure across generations with its functionality remaining unchanged. Metamorphism has been employed by malware writers as a means of evading signature detection and other advanced detection strategies. However, code morphing also has potential security benefits, since it can serve to increase the “genetic diversity” of software. We have created a me...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017